## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1599 observations. 11 attributes + 1 output attribute
Wine quality is the output of interest. It is on a 1-10 scale with 10 being good.
11 physiochemical inputs that could possibly be used to predict wine quality are also captured.
Yes. I used a log transformation to create new variables on long-tailed distributions on residual sugar, cholorides, free sulfur dioxide, total sulfur dioxide, and sulphates.
The data is clean and well structured otherwise and does not require additional wrangling beyond the optional transformations listed above.
Note: for this section I originally used the ggpairs function in the analysis. However, the equivalent plot from the psych package is presented in the Knit HMTL file for improved formatting.
The highest correlation was between free sulfur dioxide and total sulfur dioxide. After taking the log10 of each, the correlation was .785. However, this is to be expected as both are measures of sulfur dioxide.
The most useful relationships in determining wine quality appear to be volatile acidity, citric acid, pH, and alcohol.
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = rw)
## m2: lm(formula = quality ~ volatile.acidity + citric.acid, data = rw)
## m3: lm(formula = quality ~ volatile.acidity + citric.acid + alcohol,
## data = rw)
## m4: lm(formula = quality ~ volatile.acidity + citric.acid + alcohol +
## log.sulphates, data = rw)
##
## ================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------
## (Intercept) 6.566*** 6.529*** 3.055*** 3.444***
## (0.058) (0.089) (0.194) (0.196)
## volatile.acidity -1.761*** -1.723*** -1.343*** -1.217***
## (0.104) (0.125) (0.114) (0.112)
## citric.acid 0.063 0.068 -0.113
## (0.115) (0.103) (0.103)
## alcohol 0.314*** 0.303***
## (0.016) (0.016)
## log.sulphates 1.518***
## (0.181)
## ----------------------------------------------------------------
## R-squared 0.153 0.153 0.317 0.346
## adj. R-squared 0.152 0.152 0.316 0.344
## sigma 0.744 0.744 0.668 0.654
## F 287.444 143.812 246.976 210.808
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1794.312 -1794.160 -1621.596 -1587.153
## Deviance 883.198 883.030 711.603 681.597
## AIC 3594.624 3596.320 3253.192 3186.306
## BIC 3610.756 3617.828 3280.078 3218.569
## N 1599 1599 1599 1599
## ================================================================
I initially produced single plots comparing alcohol, citric acid, and volatile acidity, all colored by the quality. Due to the large number of “average” quality wines, it was diffcult to gain many visual cues. I decided to focus on the characteristics of higher quality wines (7-8) compared with lower quality (3-4). I subset the data accordingly and produced pairs of graphs with similar scales.
The lower quality wines in the sample have lower alcohol content and higher volatile acidity than the higher quality wines. However, the alcohol content for higher quality wines appears to be more evently dispursed. The lower quality wines also have a lower citric acid content coupled with the lower alcohol content and higher volatile acidity.
It was interesting to see the lack of interaction between alcohol and citric acid on the good wines. Good wines exist across most points in this plot.
I attempted a linear model to attempt to predict wine quality. A model using volatile acidity, citric acid, alcohol, and sulphates (log10) only had an R-sq of .346. Very little predictive power.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The vast majority of wines are average quality. It may be difficult to draw conclusions from such a narrow data set.
This plot shows a nice inverse relationship between volatile acidity and quality. Both the median value and distribution of acidity decrease as quality increases.
## Warning: Removed 1 rows containing missing values (geom_point).
For the lower quality wines, there are few redemptive properties. They are lower in citric acid, so they don’t have a taste as fresh and fruity as higher quality wines. But they also have a lower alcohol content. It would take more of this unpleasant beverage to achieve the desired level of “relaxation”.
For the higher quality wines, the plot of citric acid and alcohol is more evenly distributed. From a statistical point of view, this indicates little relationship between these variables at these quality levels. Perhaps other variables have stronger relationships with higher quality wines. However, from a practical standpoint, it could also mean that there are pleasant wines across many points in these scales. Personal preference may impact quality ratings.
The wine quality data set looked at the chemical properties of 1599 red wines. I produced univariate, bivariate, and multivariate plots to attempt to find the variables with the highest impact on wine quality.
One area of frustration was that nothing really “jumped out”. Many of the plots did not indicate any relationship with wine quality. It was hard to feel confident in choosing any direction for further analysis. It was also difficult to draw many conclusions from this data set since over 80% of the wines are “average” quality. Any fitted models would likely have limited predictive power. It was also challenging to model what is really a discrete or categorical variable.
During the mutivariate analysis, I felt most confident in the analysis that split the data into higher and lower quality wines. That unconventional decision finally seemed to produce something interesting. The obvious drawback to this approach is that is discards most of the data. Splitting the data this way could be useful for future analysis.
As for additional future analysis, modeling a categorical dependent variable seems to be a better approach. Additional research would be needed to determine how to implement this in R. This would also be a good opportunity to consider addtional variables in the model to attempt to improve predictive power.